Linear regressions in R assume a linear relationship between one or more predictor variables \(X_i\) and a response variable \(Y\).
\[y_i \sim Normal(\mu_i, \sigma) \\ \mu_i = \beta_0 + \beta_1 x_i\]
The first element of this model describes how the response variable \(Y\) can be modeled as a normal random variable with mean and standard deviation.
The second element of this model adds the effect of the predictor variable \(X\) on \(Y\), based on the causal relationship \(X \rightarrow Y\). In this simple case, the relationship is linear and can be described by the equation of a line with intercept \(\beta_0\) and slope \(\beta_1\).
In the introductory lecture for linear models, we used an example dataset to fit a simple regression model.
The dataset itself was generated according to the specifications of the model above. We are going to recreate the dataset and some of the calculations presented during the class.
set.seed(4) #ensures the result is reproducible. The random number generation can vary between Operative systems so maybe there will be different datasets in the classroom
x_i <- seq(1, 5, by = 0.5)
n <- length(x_i)
mu <- 1.2 + 3.5 * x_i
y_i <- rnorm(n = 9, mean = mu, sd = 3)
lm_data <- data.frame(x_i, mu, y_i, res = y_i - mu)
Create a plot for the dataset
plot(y_i ~ x_i)
As you can see, plotting the data suggests a linear relationship between these two variables
Let’s calculate the coefficients for the regression line according to the Ordinary Least Squares method:
\[\widehat{\beta_1}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
and
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]
x <- x_i
y <- y_i
beta1_hat <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x)) ^ 2)
beta1_hat
[1] 3.84557
[1] 3.84557
[1] 1.459434
Add a regression line with the values of the coefficients you calculated
When we generated the dataset (here), \(\mu\) was specified as:
mu <- 1.2 + 3.5 * x_iThis means \(\beta_0 = 1.2\) and \(\beta_1 = 3.5\). How do these values compare to the coefficients you just calculated?
Add a population line to the plot you previously created by completing the code below:
plot(y ~ x)
abline(a = beta0_hat, b = beta1_hat, col = "red")
abline(a = ..., b = ..., col = "blue")See the code by clicking Show code:
Function lm() fits regression models using Ordinary Least Squares.
The values of the coefficients can be found using function coef().
Compare this result with the calculations you did previously.
\(SS_y\), \(RSS\), and \(SS_{Reg}\)
\[RSS = \sum_{i=1}^{n}(y_i-\hat{y}_i)^2 \]
\[SS_y = {\sum_{i=1}^{n}(y_i-\bar{y})^2}\]
\[SS_Y = SS_{Reg} + RSS\]
Challenge: Given that you have already learned that the \(R^2\) of a model is the ratio between the sums of squares of the regression and the total sum of squares, calculate it and check the result comparing with the summary of the model.
Before making inferences with your fitted model, it is also a good practice to make some model diagnostics. For linear regression, the most basic diagnostic is to check if the residuals follows a normal distribution with zero mean and a constant standard deviation.
Because the normal distribution has a parameter that corresponds to the expected value, and that is independent of the other parameter, the linear regression model we have been writing as:
\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \sum_{j=0}^{J} \beta_{j} x_{ij} \\ \sigma & = C \end{align} \]
Can be also written as:
\[ \begin{align} y_i & = \sum \beta_j x_ij + \epsilon \\ \epsilon & \sim Normal(\mu, \sigma) \\ \mu & = 0 \\ \sigma & = C \end{align} \]
That is, the values of the response \(y_i\) are each a weighted sum of the predictor variables \(x_j\) plus a random Gaussian residual \(\epsilon\). To express the random variation symmetrically around the expected value, this residual has a mean value of zero, and a fixed standard deviation.
To check if this assumption is true, we plot the residuals of the
fitted model as a function of the predicted values. This plot should
show a point cloud of constant width around zero in the Y-axis. You
can check this assumption applying the function plot to the object
that stored the fitted model:
plot(mod, which = 1)
You can also check the normality of the residuals with a qq-plot. Normal data should lie in a straight line in this plot:
plot(mod, which = 2)
What is your overall evaluation of the normality of the residuals?
You can find an excellent explanation of these and other diagnostic plots for linear regression here